%# templat_name['prefix_dataset_id','basic_table','fastqc_check','fastqc_table','fastqc_graph','mapping_check','basic_map_table','mappable_ratio_graph','redundant_ratio_graph','peak_summary_table','DHS_ratio_graph','velcro_ratio_graph','venn_graph','correlation_graph','height_distibution_graph','gene_distibution_graph','conservation_graph']
%#  ------Prefix-------prefix_dataset_id

%- if section_name == "begin"
%\documentclass[11pt,a4paper]{article}
%\documentclass[twocolumn]{\VAR{bmcard}}  % uncomment this for twocolumn layout and comment line below

%uncomment this for twocolumn layout and comment line below
\documentclass{\VAR{bmcard}} % uncomment this for twocolumn layout and comment line below

\usepackage{tabularx}
\usepackage[english]{babel}
\usepackage{array}
\usepackage{graphicx}
\usepackage{color}
\usepackage{graphicx}
\usepackage{caption}

\DeclareGraphicsExtensions{.eps,.png,.pdf,.ps}

%\usepackage[paperwidth=15in, paperheight=30in]{geometry} %% margin=1.5in
%\usepackage{fullpage}
%\usepackage[cm]{fullpage}
%\usepackage[top=2in, bottom=1.5in, left=1in, right=1in]{geometry}

\usepackage{float}
\restylefloat{table}
\restylefloat{figure}
\usepackage[utf8]{inputenc} %unicode support

%%% Begin ...
\begin{document}

\begin{frontmatter}

\begin{fmbox}
\dochead{Softwares}
\title{Chip-Seq QC Report For Dataset `` \VAR{prefix_dataset_id} ''}

\author[
   % addressref={aff1},                   % id's of addresses, e.g. {aff1,aff2}
   % corref={aff1},                       % id of corresponding address, if any
   % noteref={n1},                        % id's of article notes, if any
   email={xsliu.dfci@gmail.com}   % email address
]{\inits{Xiaole}\fnm{Shirley}\snm{Liu Lab}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%                                          %%
%% Enter the authors' addresses here        %%
%%                                          %%
%% Repeat \address commands as much as      %%
%% required.                                %%
%%                                          %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%

% \address[id=aff1]{%                           % unique id
%   \orgname{College of Life science and technology, Tongji}, % university, etc
%   \street{Siping Road},                     %
%   %\postcode{}                                % post or zip code
%   \city{Shanghai},                              % city
%   \cny{China}                                    % country
% }

% \begin{artnotes}
% \note[id=n1]{Equal contributor} % note, connected to author
% \end{artnotes}

\end{fmbox}% comment this for two column layout
\begin{abstractbox}

\begin{abstract} % abstract
\parttitle{ChIP-seq QC} %if any
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Thousands Chip-seq data are generated by different lab, ChinLin quality control aims to  score and evaluate Chip-seq data experiment and analysis quality based on our collected dataset, include raw data QC, reads mapping QC, peak calling QC and a series of annotation QC.
% Text for this section.
% \parttitle{Second part title} %if any
% Text for this section.
\end{abstract}
% \begin{keyword}
% \kwd{sample}
% \kwd{article}
% \kwd{author}
% \end{keyword}
\end{abstractbox}
\end{frontmatter}

% \vspace{-1cm}
% \maketitle
% \tableofcontents
% \setcounter{tocdepth}{2}
%- endif

%- if SummaryQC
\section*{ChIP-Seq Quality Control Summary Table}
There are 4642 dataset (2888 human) stored as history data for quality comparison. Judgement criterion is set to judge data quality from different aspect and some accumulative distribution curve are drawled for showing your data's location. The shade color pink,lightgolden and palegreen respectively are representing bad quality(fail), just OK(pass) and good quality(more than 50% history data upper the score). Please refer https://docsmei.readthedocs.org/en/latest/ for  detail evaluation information. \\\\
The QC summary table (Table ~\ref{summarytable}) bellow gives an
integrated result.

\begin{table}[H]
\caption{QC Summary Table}\label{summarytable}
\begin{tabular}{lllcc}
\hline
QC item & Sample or Dataset & Score & Cutoff & Pass or Fail \\
\hline
\BLOCK{ for line in summary_table }
\VAR{line|join(' & ')} \\
\BLOCK{ endfor }
\hline
\end{tabular}
\end{table}
%- endif

%# ----- library contamination ----
%- if section_name == "library_contamination"
\section*{Reads Genomic Mapping QC measurement}
Modern high throughput sequencers can generate tens of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may affect how you can usefully use it.

\subsection*{Library contamination}
Because the quantity of immunoprecipitated DNA is typically very small it is critically important that the researcher make every effort to avoid contamination of the sample. Contaminate sources can contribute significantly to the final amplified ChIP-Seq library. In some extreme cases, samples may even be mislabeled during preparation.
To narrow this gap, sample contamination QC were conducted, which would report the sample mappability of human, mouse and rat respectively. By looking back to the mappable ratio listed in the table, researchers would get more clues about library quality.(Table ~\ref{libcontamin})
\begin{table}[H]
\caption{Library contamination}\label{libcontamin}
\begin{tabular}{lllc}
\hline
sample name & \VAR{library_contamination.meta.species|join(' & ')} \\
\hline
\BLOCK{ for k, v in library_contamination.value.items() }
\VAR{k} & \VAR{library_contamination.value[k].values() |join(' & ')} \\
\BLOCK{ endfor }
\hline
\end{tabular}
\end{table}
%- endif

%# -------------fastqc--------------fastqc_check : fastqc_table ; fastqc_graph
%- if section_name == "sequence_quality"
\subsection*{FastQC Summary table}
FastQC aims to provide a QC report which can spot problems which originate in the sequencer. 
The sequence quality score listed in the table allows you to see if a subset of your sequences have universally low quality values. Here we can report the sequence quality scores of the new datasets and use our criteria to judge the quality of the new raw data.Sequence quality score of each sample means that more than 50% reads' sequence quality upper than the number. The higher the raw data is better. Here we set 25 as a cutoff based on our collected datasets.(Table ~\ref{fastqctable}).\\

\begin{table}[H]
\caption{FastQC measurement}\label{fastqctable}
\begin{tabular}{lllc}
\hline
Sample name & Sequence length & Sequence quality score \\
\hline
\BLOCK{ for line in fastqc_table }
\VAR{line|join(' & ')} \\
\BLOCK{ endfor }
\hline
\end{tabular}
\end{table}

\subsection*{FastQC score distribution}
We draw the cumulative percentage plot of the sequence quality scores of all historic data and show the sequence quality score of each your new data(Figure ~\ref{fastqcplot}).
\begin{figure}[H]
\caption{FastQC score distribution plot} \label{fastqcplot}
\centering
{\includegraphics[scale=0.4]{\VAR{fastqc_graph}}}
\end{figure}
%- endif

%# --------basic mapping QC statistics--------
%- if section_name == "bowtie"
\subsection*{Basic mapping QC statistics}
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes.\footnote{Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25}
Short DNA sequence reads sample: AAAGGGCTGAGCTGAATGACTCAT.
Total reads: all reads sequenced in one ChIP-seq experiment. 
Mappable reads: reads can align to large genomes when 2 mismatches at most allowed.
Unique mappable reads: reads that can only map to one location.
Unique mappable locations: locations that can only be mapped by one read.
Unique reads ratio: percentage of unique mappable reads in mappable reads. The bigger the percentage is, the better.For QC judgement, Here we set 5 mega as a cutoff for unique mappable reads. if unique mappable reads is more than 5 mega, the datasets can pass mapping QC.(Table ~\ref{basicqc})
\begin{table}[h!]
\caption{Basic QC statistics} \label{basicqc}
\begin{tabular}{ lllc }
\hline
Sample & Total reads & Unique Mappable reads & Unique Mappable rate \\
\hline
\BLOCK{ for line in basic_map_table }
\VAR{line|join(' & ')} \\
\BLOCK{ endfor }
\hline
\end{tabular}
\end{table}

%# -------mappable reads ratio---------
\subsection*{Mappable reads ratio}
For following figure, this is a cumulative distribution file of the mappable rates of all the ChIP-seq data in our data collection program,the x axis is the mappable rates and the y axis represents the percentage of this mappable rate among all the data. Given several new sets of data, we can generate the locations of your data among all the data.The mappable rate equals to mapped reads devided by total reads, and the higher this rate is, the better your ChIP-seq experiment you've performed and the higher data quality your data is.(Figure: ~\ref{fig:mappinratio})
\begin{figure}[H]
\caption{Mappable reads ratio} \label{fig:mappinratio}
\centering
{\includegraphics[scale=0.4]{\VAR{mappable_ratio_graph}}}
\end{figure}
%- endif

%# -------mappable redundant rate-----------
%- if section_name == "redundant"
\subsection*{Mappable Non-Redundant rate}
This is a cumulative distribution file of the mappable non-redundant ratio of all the ChIP-seq data in our data collection program,the x axis is the non-redundant ratio and the y axis represents the percentage of this ratio among all the data.Given several new sets of data, we can generate the locations of your data among all the data. the higher this ratio is, the higher data quality your data is.Here we set 0.8 as a cutoff for non-redundant rate.(Figure: ~\ref{fig:uni})
\begin{figure}[H]
\caption{Mappable Non-Redundant rate} \label{fig:uni}
\centering
{\includegraphics[scale=0.4]{\VAR{redundant_ratio_graph}}}
\end{figure}
%- endif

%# ---------peak calling QC---------- peak_summary_table ; DHS_ratio_graph ; velcro_ratio_graph

%- if section_name == "high_confident_peaks"
\section*{Peak calling QC measurement}
\subsection*{Peak calling summary}
This report aims to show the quality of the peaks called by MACS.Peaks summary,  distribution, high confident peaks, Peaks overlapped with DHS, Peaks overlapped with Velcro sites(human only) is arranged to value peak calling result. 
The column of 'Cut off' shows what the Q-value (default is 0.01) is used when processing MACS. And the lower Q-value is used, MACS will call the more confident peaks. 
The 'Peak count' shows the total number of peaks calculated under the condition of a Q-value. 
For the column of 'peaks $\geq$ 10FC' refers to high confident peaks, it shows that the number of peaks that require the fold enrichment of tags locate on the genome $\geq$ 10. The log value can give us more confidence in our data.
The 'Shift size' shows that the distance between the Watson or Crick tags enrichment location and the binding site's location. Here we set 1000 and 3 as cutoff for total peaks number on default Q-value and number of high confident peaks under log10. (Table ~\ref{peaksum})
\begin{table}[H]
  \caption{Peak summary table} \label{peaksum}
\begin{tabular}{ lllcc }
\hline
Run name & False Discovery Rate & Peak count & peaks $\geq$ 10FC & Shift size \\
\hline
\VAR{peak_summary_table|join(' & ')} \\
\hline
\end{tabular}
\end{table}

%#---------- Hight confident Peak
\subsection*{High confident Peak }
The figure shows the distribution for the number's logarithm of peaks that require fold enrichment $\geq$ 10. For zero 10-fold-enrichment peaks, the number is set to 0.1. Here the cutoff is 3. If a data locates on the left side, it won't be considered to have sufficient high-confidence peaks. On the contrary, if a data locates on the right side, it can be thought to have sufficient high-confidence peaks. In other words, if the number of 10-fold enrichment peaks $\geq$ 1000, we can safely say the data have abundant high-confidence peaks.(Figure ~\ref{highconfipeak})
\begin{figure}[H]
\caption{High confident Peak distribution} \label{highconfipeak}
\centering
{\includegraphics[scale=0.4]{\VAR{high_confident_peak_graph}}}
\end{figure}
%- endif

%#-------Peaks overlapped with DHS----------
%- if section_name == "DHS_velcro"
\subsection*{Peaks overlapped with DHS(Dnase Hypersensitivity sites)}
In eukaryotes, transcription is regulated in a cell-type and condition-specific manner through the association of transcription factors with chromatin. The genome-wide binding sites of transcription factors are influenced by the active protein levels of the transcription factors, chromatin structure, and DNA sequence.
DNase I hypersensitivity is an alternative measure of chromatin accessibility (Wu 1980). DNase I hypersensitive sites (DHS), short regions of chromatin that are highly sensitive to cleavage by DNase I, typically occur in nucleosome free regions and frequently arise as a result of transcription factor binding. DNase I digestion followed by high-throughput sequencing (DNase-seq) has evolved into a powerful technique for identifying genome-wide DNase hypersensitive sites. \footnote{Ling et al. 2010; John et al. 2011; Siersbaek et al. 2011}
We draw a CDF curve of the overlap with union DHS sites of all DC data, then we can assess the confidence level of  the TF binding sites described by your peak file.Here we set 0.8 as a cutoff to judge whether peaks overlapped with DHS can pass our QC criterion. (Figure: ~\ref{DHS})
\begin{figure}[H]
\begin{center}
\caption{Peaks overlapped with DHS data} \label{DHS}
{\includegraphics[scale=0.4]{\VAR{DHS_ratio_graph}}}
\end{center}
\end{figure}
%\newpage

%#------- reads enrichment -----------
%- if section_name == "read_enrichment_check"
\subsection*{Read enrichment}
In order to find solid transcription binding site, experimenter usually conduct a control experiment. Read enrichment QC aims to present reads enrichment across genome and compare ChIP experiment and control experiment, normally read enrichment across genome in control experiment is smooth but very sharp in ChIP experiment, Here we Choose DHS sites region as genome background and calculate read coverage over these DHS binding site, The following graph present read coverage over DHS sites between ChIP and control experiment. (Figure ~\ref{fig:Read_enrichment}) \newline
\begin{figure}
        \caption{read enrichment in DHS} \label{fig:Read_enrichment}
        \centering
        {\includegraphics[scale=0.4]{\VAR{read_enrichment_graph}}}
\end{figure}
%- endif

%# -------peaks overlaped with Velcro---------
\subsection*{Non-Velcro ratio(human only)}
There has a comprehensive set of regions in the human genome that have anomalous, unstructured, high signal/read counts in next gen sequencing experiments independent of cell line and type of experiment. We call them Consensus Signal Artifact Regions (verlcro region). The breadth of cell-lines covered by the ENCODE datasets allows us to accomplish this in a systematic manner.
We use 80 open chromatin tracks (DNase and FAIRE datasets) and 12 ChIP-seq input/control tracks spanning ~60 cell lines in total , identify ~400 Consensus Signal Artifact Regions. We judge the quality and the confidence level of the TF binding sites by its overlap with these region. If the binding sites of a TF have more than 5% binding sites that overlap with these region, it may tell that the TF binding sites are not so reliable.
We draw a CDF curve of the overlap with Non-verlcro region of all DC data, then we can assess the confidence level of  the TF binding sites described by your peak file.For non-verlcro region, we set 0.9 as a cutoff to do QC. (Figure: ~\ref{fig:velcro})
\begin{figure}[H]
\caption{Non-Velcro ratio} \label{fig:velcro}
\centering
{\includegraphics[scale=0.4]{\VAR{velcro_ratio_graph}}}
\end{figure}
%- endif

%# -------replicate QC-------replicate_check : venn_graph ; correlation_graph
%- if section_name == "venn"
\subsection*{Profile correlation within union peak regions}
If biologist have replicate chipseq experiments,pipline will draw the correlation plot.the score means the correlation among these replicates' profiles. We can judge the replicate experiment quality if the correlaton is higher than 0.8.(Table: ~\ref{fig:profileunion})
\begin{figure}[H]
        \caption{Peaks Overlap correlation diagram between Replicates} \label{fig:profileunion}
        \centering
        {\includegraphics[scale=0.4]{\VAR{venn_graph}}}
\end{figure}
%- endif

%- if section_name == "correlation"
\subsection*{Peaks overlap between Replicates}
If biologist have replicate chipseq experiments,pipline will draw the venn plot.If chipseq peaks  among replicates have more than 1 bp overlap,we take as one overlap.The bigger overlap among replicates, the higher quality in replicate experiments. (Figure: ~\ref{fig:venn})
\begin{figure}[H]
        \caption{Peaks Overlap venn diagram between Replicates} \label{fig:venn}
        \centering
        {\includegraphics[scale=0.4]{\VAR{correlation_graph}}}
\end{figure}
%- endif

%#---------Ceas QC---------
%- if section_name=='ceas'
\section*{Functional Genomic QC measurement}
\subsection*{Peak Height distribution}
According to the peak location, All peaks are separated into several part, promoter region; intron; Distal intragenic region and so on. In addition peaks fold change accumulative plot is set to value the peak quality and distribution. 
\begin{figure}[H]
        \caption{Peak height distribution} \label{Peakdist}
        \centering
        {\includegraphics[scale=0.5]{\VAR{meta_gene_graph}}}
\end{figure}

\subsection*{Meta Gene distribution}
Add all the reads density corresponding to its located meta genes(i.e. to the contrast of the whole gene, meta gene indicates the single gene units) together result in those "average profile". Usually there will be a peak near TSS(i.e.transcription start site).
Checking those maps gives us an insight about the average reads location near TSS, TTS and the whole genome.(Figure ~\ref{Meta})
\begin{figure}[H]
        \caption{Meta Gene distribution} \label{Meta}
        \centering
        {\includegraphics[scale=0.5]{\VAR{gene_distribution_graph}}}
\end{figure}
%- endif

%# ---------Conservation QC----------- conservation_graph
%- if section_name == "conservation"
\subsection*{Peak conservation score} 
Conserved sequences are similar or identical sequences across species, it has been maintained by evolution despite speciation. Highly conserved sequences are thought to have conserved functional value. Conservation plot is set to judge the conservation of these transcriptions binding site or histome modification site. For transcription factor 500 bp upstream and downstream to summit of each peak is predefined, 2000 bp for histome modification.Through k-means cluster method we separate our history datasets\' conservation plot into several pattern, we choose the most similar conservation pattern with your dataset as a judgement.     (Figure ~\ref{fig:conservation}) \newline
% \begin{figure}[H]
%         \caption{Phascon conservation distribution} \label{fig:conservation}
%         \centering
%         {\includegraphics[scale=0.5]{\VAR{conservation_graph}}}
% \end{figure}
\begin{figure}[H]
        \caption{Phascon conservation distribution of original group and compared group} \label{fig:conservation_compare}
        \centering
        {\includegraphics[scale=0.4]{\VAR{conservation_compare_graph}}}
\end{figure}
%- endif

%- if section_name == "motif"
%# -----------motif QC---------------
\subsection*{Motif QC measurement analysis}
Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF). \footnote{ Patrik D'haeseleer. What are DNA sequence motifs. nature biotechnology 2006,24.}
We scan the top 1000 peak regions found in MACS to detect the mighty factors' motifs. 
Name: corresponding factor name for motif seqLogo.
Hits: times the motif occurs in the top 1000 regions. More it occurs, more credibility.
Z-score: the value should better be less than -15. And the smaller it is, the more credibility.
Logo: the height represents the reliability and information. If it is very low, then we cannot make sure which base it is.
We might find several motifs different from the antibody we use. For example, we do the AR ChIP-seq experiment, but in the motif finding results we might also find FoxA1, suggest that FoxA1 might be a co-factor for AR. 
In order to avoid redundant motif prediction and motifs with too much similarity between denovo and past identified ones, we used a  function to merge the replicates and redundant. 
We search in cistrome database and use denovo motif finding methods 
to generate prediction of experiment factor's motifs. 
In order to avoid redundant motif prediction and motifs with too much similarity between denovo and past identified ones, we used a  function to merge the replicates and redundant. In addition, those credibile motif with Zscore under -15 were kept, So if none significant motif detected, motif table won't display in QC report.
For TF we set a QC judgement, if the experiment transcription factor can be detected in the motif table, the dataset pass motif QC.   (Table ~\ref{motif})

\begin{table}[h!]
        \caption{Seqpos QC measurement} \label{motif}
\begin{tabular}{llccp{2.8in}}
\hline
Factor name & Z-score & Hits  & Motif logo \tabularnewline
\hline
\BLOCK{ for line in motif_table }
*\VAR{line.factors|join(' ')} & \VAR{line.zscore} & \VAR{line.hits} & \parbox[l]{0.7em}{\includegraphics[scale=0.6]{\VAR{line.logoImg}}} \\
\BLOCK{ endfor }
\hline
\end{tabular}
\end{table}
%- endif

%# Summary QC -----------
%- if section_name == "ending"
\end{document}
%- endif
